Skip to content

Add robust parallel trends testing with Wasserstein distance#3

Merged
igerber merged 1 commit into
mainfrom
claude/init-did-library-pvNmf
Jan 1, 2026
Merged

Add robust parallel trends testing with Wasserstein distance#3
igerber merged 1 commit into
mainfrom
claude/init-did-library-pvNmf

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Jan 1, 2026

No description provided.

- Add check_parallel_trends_robust() using Wasserstein (Earth Mover's) distance
  for distributional comparison of pre-treatment outcome changes
- Include permutation-based p-value for statistical inference
- Add Kolmogorov-Smirnov test as complementary distributional test
- Add equivalence_test_trends() using TOST procedure
- Compute normalized Wasserstein and variance ratio diagnostics
- Add 9 new tests for robust parallel trends functionality
- Update README with usage examples for all three approaches

The Wasserstein distance is more robust than simple slope comparisons
because it captures differences in the full distribution shape, not
just means, making it better suited for heterogeneous effects.
@igerber igerber merged commit d86b108 into main Jan 1, 2026
@igerber igerber deleted the claude/init-did-library-pvNmf branch January 3, 2026 12:52
igerber pushed a commit that referenced this pull request Jan 4, 2026
Revised review reflects:
- #1, #4 verified as non-issues (correct by design)
- #3, #5, #6, #8, #13 addressed in commit e40d6b4
- Updated recommendation to approve and merge
- Remaining items are low-priority style suggestions for future PRs
igerber added a commit that referenced this pull request Apr 17, 2026
- P1 #1/#2: Add _validate_group_constant_strata_psu() helper and call
  it from fit() after the weight_type/replicate-weights checks. The
  dCDH IF expansion psi_i = U[g] * (w_i / W_g) treats each group as
  the effective sampling unit; when strata or PSU vary within group it
  silently spreads horizon-specific IF mass across observations in
  different PSUs, contaminating the stratified-PSU variance. Walk back
  the overstated claim at the old line 669 comment to match. Within-
  group-varying weights remain supported.
- P1 #3: _survey_se_from_group_if now filters zero-weight rows before
  np.unique/np.bincount so NaN / non-comparable group IDs on excluded
  subpopulation rows cannot crash SE factorization. psi stays full-
  length with zeros in excluded positions to preserve alignment with
  resolved.strata / resolved.psu inside compute_survey_if_variance.
- REGISTRY.md line 652 Note updated: explicitly states the
  within-group-constant strata/PSU requirement and the
  within-group-varying weights support.
- Tests: new TestSurveyWithinGroupValidation class (4 tests — rejects
  varying PSU, rejects varying strata, accepts varying weights, and
  ignores zero-weight rows during the constancy check) plus
  TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_group_id.

All 268 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
Addresses the second-round CI review findings:

- P1 false-pass (remaining): removed five phase-local try/except blocks
  that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness
  and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response
  dataframe extraction). Exceptions now escape, the phase is marked
  ok=false, and run_scenario's atexit handler exits nonzero. The fix
  caught a real API-usage bug on its first rerun: dose_response extract
  phase tried to pull event_study level on a result fit with
  aggregate="dose"; the event_study fit lives in a dedicated phase, so
  that level is removed from the extraction loop.
- P2 scenario-spec drift: BRFSS scenario text now says pweight TSL
  stage-2 (matching the aggregate_survey-returned design), not "Full
  replicate-weight path"; dCDH reversible scenario text now says
  heterogeneity="group" (matching the script), not "cohort".
- P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and
  site-packages before writing the committed txt.

Drift-prevention layer:

- gen_findings_tables.py reads every JSON baseline and rewrites the
  numerical tables in performance-plan.md between
  <!-- TABLE:start <id> --> / <!-- TABLE:end <id> --> markers. Tables
  now re-derive from data on every rerun, eliminating the hand-edit
  drift the prior review flagged. Narrative prose stays hand-written
  by design, forcing a human re-read of findings when numbers shift.

Findings refresh (the numbers moved slightly; three narrative claims
needed updating):

- "Rust marginally slower than Python on JK1 at large scale" -> removed;
  fresh data has Rust and Python within noise on brand awareness at
  large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04).
- "ImputationDiD consistently dominant phase at all scales" -> narrowed
  to "dominant under Python; tied with SunAbraham under Rust at large".
- "Nine-figures of MB" in memory finding #3 was a phrasing error
  (literally 100+ TB); corrected to "mid-100s of MB".

Priority of optimization opportunities refreshed against new data:

- #1 aggregate_survey precompute stratum scaffolding: High (unchanged,
  now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100%
  of chain runtime, growth only +31 MB).
- #2 Staggered CS working-memory audit: Low with explicit bump-trigger
  (Rust large crosses 512 MB Lambda line).
- #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low -
  the "Rust regression to fix" leg of the rationale is gone because
  Rust is no longer slower.

Net: one clear priority (aggregate_survey fix), four optional follow-ups.
Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 22, 2026
P1 #1 — Stute tie-safe CvM:
Paper defines c_G(d) = Σ 1{D ≤ d} · eps with c_G(D_g) evaluated AT each
observation's dose, so tied observations share the post-tie cumulative sum.
My naive cumsum over sorted residuals produced partial within-tie sums that
were row-order-dependent. Fix: after cumsum, replace within-tie-block values
with the block's last cumsum via np.unique + np.repeat. `_cvm_statistic` now
accepts `d_sorted` and collapses tie blocks before squaring. Regression
test `test_cvm_statistic_tie_safe_order_invariance` pins order-invariance
on duplicate doses at atol=1e-14; `test_stute_order_invariance_with_duplicate_doses`
validates the end-to-end stute_test contract.

P1 #2 — Exact-linear fit must fail-to-reject (not return NaN):
For dy = a + b·d exact, Assumption 8 holds exactly and the correct outcome
is p=1, reject=False. My previous var(eps)<=0 check routed this to NaN. Fix:
dropped var(eps) degeneracy branch from stute_test (the bootstrap naturally
produces p=1 when eps=0 exactly). Added a scale-relative short-circuit
(sum(eps²) ≤ 1e-24 · sum(dy²)) in both stute_test and yatchew_hr_test so
FP noise (eps ~ 1e-16 from IEEE arithmetic on dy = 1 + 2*d) doesn't defeat
the short-circuit by producing non-zero but tiny OLS residuals. Yatchew
exact-linear now returns (t_stat_hr=-inf, p=1, reject=False) rather than
NaN. Regressions: TestStuteTest.test_exact_linear_returns_p1_not_nan,
TestYatchewHRTest.test_exact_linear_returns_p1_not_nan.

P1 #3 — HADPretestReport.all_pass contract:
Previously `all_pass = not (reject or reject or reject)` could be True
while `verdict` said "inconclusive - X NaN". Fix: gate all_pass on every
constituent p-value being finite AND no test rejecting. Updated docstring.
Regression: TestCompositeWorkflow.test_all_pass_false_when_any_test_nan.

P2 #1 — QUG negative-dose guard:
HAD doses must be non-negative (paper Section 2). The raw qug_test API
was silently folding d < 0 rows into the n_excluded_zero counter (filter
was `d > 0`). Fix: front-door ValueError on any d < 0. Regression:
TestQUGTest.test_negative_dose_raises.

P3 #1 — QUG np.partition:
REGISTRY claims O(G) via np.partition. Code was using np.sort. Switched
qug_test to np.partition(d_nz, 1), which guarantees partitioned[0] ≤
partitioned[1] = D_{(2)}, i.e., partitioned[0] = D_{(1)}. Tight
closed-form parity at atol=1e-12 still holds.

P3 #2 — REGISTRY n_bootstrap default:
REGISTRY said "Default n_bootstrap = 499" but code ships 999. Updated
REGISTRY to match code and added a note about the n_bootstrap >= 99
front-door validation.

Test count: 47 -> 53.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
dCDH by_path + placebo: per-path backward-horizon placebos (Wave 2 #3)
igerber added a commit that referenced this pull request Apr 26, 2026
R5 was ✅ Looks good — only P3 polish remained. All addressed:

P3 #1 — exact-pin nprobust:
The parity contract runs through nprobust numerical paths
(DIDHAD's local-linear bandwidth + bias-correction calls), so a
fresh regeneration could drift if CRAN serves a newer nprobust.
Pin nprobust == 0.5.0 in both the R generator's stopifnot guard
and the parity test's metadata assertion alongside DIDHAD and
YatchewTest.

P3 #2 — workflow docstring:
did_had_pretest_workflow's top-level docstring still said "Eq 18
linear-trend detrending is a Phase 4 follow-up" which contradicts
the shipped trends_lin behavior. Updated to describe the
forwarding contract (trends_lin → joint_pretrends_test +
joint_homogeneity_test, consumed-placebo skip path on minimal
panels). Same fix on the StuteJointResult class docstring.

P3 #3 — parity test horizon-shape assertions:
Added an explicit "missing in Python" assertion in _zip_r_python:
every R-mapped event time must be present in Python's event_times
(catches future horizon-shape regressions where Python silently
drops a horizon R requested). Added an effects+placebo row-count
sanity check in test_yatchew_t_stat_parity (uses the previously-
unused effects/placebo parametrize values to catch fixture drift).

Stats: 540 tests pass, 0 regressions. No estimator/methodology
changes — all P3 polish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 26, 2026
R5 was ✅ Looks good — only P3 polish remained. All addressed:

P3 #1 — exact-pin nprobust:
The parity contract runs through nprobust numerical paths
(DIDHAD's local-linear bandwidth + bias-correction calls), so a
fresh regeneration could drift if CRAN serves a newer nprobust.
Pin nprobust == 0.5.0 in both the R generator's stopifnot guard
and the parity test's metadata assertion alongside DIDHAD and
YatchewTest.

P3 #2 — workflow docstring:
did_had_pretest_workflow's top-level docstring still said "Eq 18
linear-trend detrending is a Phase 4 follow-up" which contradicts
the shipped trends_lin behavior. Updated to describe the
forwarding contract (trends_lin → joint_pretrends_test +
joint_homogeneity_test, consumed-placebo skip path on minimal
panels). Same fix on the StuteJointResult class docstring.

P3 #3 — parity test horizon-shape assertions:
Added an explicit "missing in Python" assertion in _zip_r_python:
every R-mapped event time must be present in Python's event_times
(catches future horizon-shape regressions where Python silently
drops a horizon R requested). Added an effects+placebo row-count
sanity check in test_yatchew_t_stat_parity (uses the previously-
unused effects/placebo parametrize values to catch fixture drift).

Stats: 540 tests pass, 0 regressions. No estimator/methodology
changes — all P3 polish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request May 14, 2026
R1 P1 #1: pin remaining dismissal invariants
- Comment block claims 4 invariants hold but only invariants #1 (no
  execution) and #2 (fork-skip) had test coverage. Add 3 tests:
  - test_workflow_codex_step_uses_read_only_sandbox (invariant #1
    other half: sandbox: read-only)
  - test_workflow_resolve_pr_sets_head_sha_from_api (invariant #4:
    head_sha API-pinned, not from event payload)
  - test_workflow_comment_triggers_require_author_association
    (invariant #3: comment triggers gated on
    OWNER/MEMBER/COLLABORATOR)

R1 P1 #2: make guard test fail-closed across run scalar styles
- Prior regex only matched `run: |` literal blocks; inline `run: pytest`
  and folded `run: >` bypassed the scan entirely.
- Extract _extract_all_run_content static method that handles all three
  scalar styles (literal `|` with chomping variants, folded `>` with
  variants, and inline single-line). Both existing tests and a new
  python-file-exec test now use it.
- Expand FORBIDDEN_EXECUTION_PATTERNS to include `pip3 install` and
  `npm ci` (reviewer-named omissions).
- Add test_workflow_no_python_file_execution_against_workspace: regex
  flags `python(3)? <path>.py` invocations against workspace-relative
  paths (PR-head bytes), allowlists /tmp/-prefixed paths (BASE-staged
  via git show). Inline scripts (-c) and module invocations (-m) don't
  capture .py tokens, naturally excluded.

Test-the-test verified inline + folded + literal + npm ci + python
workspace all fire; python /tmp/ correctly does not. All 24 workflow
tests pass.
igerber added a commit that referenced this pull request May 20, 2026
…prose

L194 checklist update (this PR) said the Eq. 18 detrending variant
shipped in PR #389; the explanatory prose immediately below at L200
still said "Phase 4 extends it with the Eq (18) detrending" as if it
were future work. Rewritten to past tense matching the L194 closure
and the REGISTRY § "Note (Phase 4 — Eq 17 / Eq 18 linear-trend
detrending shipped)" framing. Only the Pierce-Schott numerical
replication remains waived (REGISTRY Deviations Note #3).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
HanomicsIMF pushed a commit to HanomicsIMF/diff-diff that referenced this pull request May 22, 2026
Local Codex review on commit 79e0962 returned ✅ with 3 P3s (all
documentation/coverage, no actionable P0/P1). Per the test-coverage P3
upgrade rule (feedback_test_coverage_gap_treat_as_actionable.md),
addressing all three:

P3 igerber#1 (Code Quality): `_compute_cr2_bm_contrast_dof` was missing the
`ndim` validation that the parallel one-way `_compute_bm_dof_from_contrasts`
helper has, so a stray `(k,)` 1-D vector would die with a low-level
indexing error instead of a contract error. Added the same shape-tuple
check pattern (`if contrasts.ndim != 2 or contrasts.shape[0] != k`).

P3 igerber#2 (Docs): two stale doc surfaces post-feature-lift —
  - `estimators.py:68-71` base estimator docstring still said MPD did
    NOT support cluster + hc2_bm. Rewrote to describe the new
    cluster-aware contrast-DOF support and flag survey CR2-BM as the
    remaining gate.
  - `tests/test_linalg_hc2_bm.py` module banner still said clustered
    CR2 BM was "deferred to a follow-up". Updated to describe both the
    per-coefficient and the new compound-contrast DOF surfaces, and
    narrow the deferral to the weighted CR2-BM case only.

P3 igerber#3 (Tests): the new MPD test only asserted finite output, so a
regression that silently fell back to the shared n-k DOF would still
pass. Added `test_multi_period_cluster_hc2_bm_avg_att_uses_clubsandwich_dof`
which fits MPD on the new R `mpd_clustered_avg_att_dof` fixture and
recovers the implied Satterthwaite DOF by inverting
`avg_p_value = 2 * (1 - t.cdf(|avg_t_stat|, df))` via scipy.brentq. The
recovered DOF must match the R `Wald_test(test="HTZ")$df_denom` at
atol=1e-6. Also pins that the implied DOF is much smaller than the
n-k fallback (~39 here) — catches a regression to the shared df path.

All 254 tests in tests/test_linalg_hc2_bm.py + test_estimators_vcov_type.py
+ test_estimators.py pass; lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HanomicsIMF pushed a commit to HanomicsIMF/diff-diff that referenced this pull request May 22, 2026
Add tests/test_methodology_had.py (6 classes, 34 tests) with paper-
equation-numbered Verified Components walk-through against de
Chaisemartin, Ciccia, D'Haultfoeuille & Knau (2026) arXiv:2405.04465v6
covering Equations 3 / 7 / 11 / 18 / 29 and Theorems 1 / 3 / 4 / 7:

- TestHADTheorem1Design1Prime: Eq. 3 Design 1' WAS recovery + N(0,1)
  coverage check at n_replicates=200, G=1000 with KS-stat <= 0.05 and
  empirical 95% coverage >= 0.90
- TestHADTheorem3MassPoint: Eq. 11 / Theorem 3 mass-point WAS_{d_lower}
  recovery + Wald-IV closed-form equivalence at atol=1e-9
- TestHADTheorem4QUG: Theorem 4 limit-law distributional match against
  closed-form F(t) = t/(1+t) at KS-stat <= 0.05, n_draws=5000, G=2000
- TestHADTheorem7YatchewHR: Eq. 29 standard-normal limit, paper-literal
  sigma2_diff = 1/(2G) normalization lock
- TestHADJointStute: Section 4.2 step 2 + 4.3 mean-independence variant
  H0 fail-to-reject + H1 reject under nonlinear DGP
- TestHADDeviations: equal-weighting invariance, sup-t bootstrap gating,
  staggered-timing fail-closed ValueError, safe_inference joint NaN

Add Assumption 5/6 non-testability documentation:
- HeterogeneousAdoptionDiD class docstring: new "Non-testable assumptions
  (paper Section 3.1.2)" Notes block citing Section 3.1.2 + cross-
  referencing the existing fit-time UserWarning at had.py:3372-3390
- qug_test / stute_test / yatchew_hr_test / did_had_pretest_workflow:
  "Scope (what this test does NOT cover)" clauses in Notes sections
  explicitly stating tests verify ADJACENT assumptions (4 / 7 / 8) and
  CANNOT test Assumptions 5 or 6

Close paper-review checklist L182-L194 + REGISTRY HAD Implementation
Checklist L2602-L2604: Phase 1a/1b/1c implementation closures (panel
validator, design paths, local-linear backend, bias-corrected CI),
staggered-timing fail-closed ValueError, zero-dose UserWarning filter,
Assumption 5/6 non-testability documentation. L2604 (covariates=
Theorem 6 NotImplementedError) remains [ ] with explicit TODO.md
cross-reference (currently a Python TypeError, fail-closed).

Waive Phase-4 validation-harness items igerber#1 (Pierce-Schott 2016 Figure 2)
+ igerber#2 (Table 1 coverage rates) with documented rationale: R parity at
atol=1e-8 in test_did_had_parity.py (3 DGPs x 5 method combos, bit-exact
via rtol=0) is a strictly stronger correctness anchor than coverage-rate
MC. Paper Section 5.2 itself self-acknowledges NP estimators too noisy
to be informative on the LBD-restricted PNTR panel.

REGISTRY HAD section gains a consolidated Deviations block (5 entries
with framing header distinguishing Notes igerber#1-igerber#2 = implementation choices
from Notes igerber#3-igerber#4 = waived validation-harness work from igerber#5 = Library
extension for staggered-timing fail-closed). Existing scattered Note
entries at L2313 (equal-weighting) and L2398 (sup-t gating) referenced
from the new block.

METHODOLOGY_REVIEW.md HAD row promoted In Progress -> Complete, detail
section rewritten with Verified Components / Test Coverage / Corrections
Made / Deviations / Outstanding Concerns structure mirroring the Bacon /
TripleDifference Complete-row layout.

TODO.md: existing Phase 4 Pierce-Schott row annotated with the 2026-05-20
waiver decision + rationale; new follow-up row for covariates= Theorem 6
NotImplementedError +Theorem 6 pointer (Low priority).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
HanomicsIMF pushed a commit to HanomicsIMF/diff-diff that referenced this pull request May 22, 2026
- P3 (Methodology): the promoted HAD materials described the Eq. 17/18
  `trends_lin=True` linear-trend-detrended variant as "deferred per Phase 4".
  This conflated TWO different things: (a) the FEATURE — which is shipped
  via the `trends_lin: bool = False` keyword-only kwarg on HAD.fit(),
  joint_pretrends_test, and joint_homogeneity_test (PR igerber#389; R-parity locked
  against DIDHAD::did_had(trends_lin=TRUE) v2.0.0 in test_did_had_parity.py);
  and (b) the PIERCE-SCHOTT NUMERICAL REPLICATION against the published
  p=0.51 anchor on the LBD-restricted panel, which IS waived per REGISTRY
  Deviations Note igerber#3. Updated 3 surfaces (paper-review L194, METHODOLOGY_REVIEW
  Eq. 18 Verified-Components row, test_methodology_had.py module docstring +
  TestHADJointStute class docstring) to distinguish "feature shipped + R-parity
  locked elsewhere" from "Pierce-Schott numerical replication waived".

- P3 (Documentation/Tests): TestHADJointStute promotion narrative overstated
  H1 coverage as "H0 fail-to-reject and H1 reject on linear vs nonlinear DGPs"
  for both joint_pretrends_test and joint_homogeneity_test. Reality: H1
  rejection is tested only on joint_homogeneity_test via a quadratic post-
  DGP; joint_pretrends_test gets H0-only coverage in this file (H1 would
  require a violating-pretrends fixture that re-verifies bootstrap calibration
  covered by test_had_pretests.py). Narrowed wording in METHODOLOGY_REVIEW
  Verified-Components row + TestHADJointStute class docstring; CHANGELOG entry
  unchanged (the H1 reject claim in CHANGELOG explicitly cites the homogeneity
  side via "H1 reject under nonlinear DGP", which is accurate).

All 35 methodology tests pass; lint clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
HanomicsIMF pushed a commit to HanomicsIMF/diff-diff that referenced this pull request May 22, 2026
…GELOG H1 scope

- R6 fix left METHODOLOGY_REVIEW.md Deviations item igerber#6 stale (only updated
  the Verified-Components row). Item igerber#6 still said "Eq. 18 linear-trend-
  detrended joint Stute deferred". Rewritten to match the rest of the
  HAD tracker: trends_lin=True is SHIPPED + R-parity-locked in
  test_did_had_parity.py; the methodology-walkthrough file deliberately
  doesn't duplicate that coverage; the Pierce-Schott published-value
  numerical replication is what's waived (Deviations Note igerber#3).

- R6 narrowed the Verified-Components row + class docstring but missed the
  CHANGELOG bullet, which still claimed "joint Stute pre-trends + homogeneity
  H0 fail-to-reject + H1 reject under nonlinear DGP". Narrowed to:
  "H0 fail-to-reject on both surfaces and H1 reject for joint homogeneity
  under a nonlinear DGP" — matches the test file's actual scope.

All 35 methodology tests pass; lint clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
igerber added a commit that referenced this pull request May 25, 2026
… to_dict()

ImputationDiDResults now exposes `vcov_type`, `cluster_name`, `n_clusters`
and a new `to_dict()` method (Phase 1b interstitial #3), but the shared
"Common Results Pattern for Staggered Estimators" section in llms-full.txt
still listed only `summary()`, `print_summary()`, and `to_dataframe()`.
Adds a variance-metadata table and threads `to_dict()` into the Methods
line so AI-agent guide consumers can discover the surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request May 25, 2026
Phase 1b interstitial #3 for ImputationDiD. Mirrors the CallawaySantAnna
(PR #487) + TripleDifference (PR #488) template for IF-based estimators:
vcov_type is permanently narrow to {"hc1"} because the per-unit influence
function aggregation (Borusyak-Jaravel-Spiess 2024 Theorem 3) has no
single design matrix on which hat-matrix leverage or Bell-McCaffrey
Satterthwaite DOF can be defined.

Source surface:
- diff_diff/imputation.py: vcov_type param + @staticmethod
  _validate_vcov_type + fit()-time revalidation +
  cluster+replicate-weights NotImplementedError guard +
  Results cluster_name/n_clusters resolution
- diff_diff/imputation_results.py: vcov_type/cluster_name/n_clusters
  fields + new to_dict() + variance-estimator line in summary() routing
  through shared _format_vcov_label helper
- diff_diff/imputation_bootstrap.py: dual-site n_clusters<2 /
  n_psu<2 NaN guards via new _build_nan_bootstrap_results helper
  (closes the BLAS-roundoff zero-SE class predicted to recur on
  IF-based estimators)

Tests: 34 new tests in TestImputationDiDVcovType covering default /
cluster / TSL-survey / replicate-survey bit-equality (parametrized over
aggregate modes), bootstrap × cluster + bootstrap × survey bit-equality,
fit()-time revalidation after set_params bypass, bootstrap n_psu<2 /
n_clusters<2 NaN propagation, pretrends bit-equality, and the full
introspection + safety-gate surface (8 tests).

Docs: REGISTRY.md (IF-based taxonomy + 4 new Notes), CHANGELOG.md,
TODO.md (row narrowed, Conley follow-up added), llms-full.txt
(vcov_type + pretrends signature drift fix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request May 25, 2026
… to_dict()

ImputationDiDResults now exposes `vcov_type`, `cluster_name`, `n_clusters`
and a new `to_dict()` method (Phase 1b interstitial #3), but the shared
"Common Results Pattern for Staggered Estimators" section in llms-full.txt
still listed only `summary()`, `print_summary()`, and `to_dataframe()`.
Adds a variance-metadata table and threads `to_dict()` into the Methods
line so AI-agent guide consumers can discover the surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request May 25, 2026
… 3 (Phase 1b interstitial #3)

Phase 1b interstitial #3 for ImputationDiD. Mirrors the CallawaySantAnna
(PR #487) + TripleDifference (PR #488) template for IF-based estimators:
vcov_type is permanently narrow to {"hc1"} because the per-unit influence
function aggregation (Borusyak-Jaravel-Spiess 2024 Theorem 3) has no
single design matrix on which hat-matrix leverage or Bell-McCaffrey
Satterthwaite DOF can be defined.

Source surface (diff_diff/):
- imputation.py: vcov_type param + @staticmethod _validate_vcov_type +
  fit()-time revalidation + cluster+replicate-weights NotImplementedError
  guard + Results metadata resolution (cluster_name=unit by default for
  the Theorem 3 unit-clustered IF variance; suppressed under ANY survey
  design — analytical OR replicate — because replicate variance ignores
  cluster/PSU entirely)
- imputation_results.py: vcov_type/cluster_name/n_clusters fields, new
  to_dict() method, summary() variance line via shared _format_vcov_label
  (default cluster=None renders "CR1 cluster-robust at <unit>, G=<n>";
  bootstrap fits suppress the analytical label and render
  "Inference method: bootstrap" instead, mirroring DiDResults.summary()
  gate at results.py:213-226)
- imputation_bootstrap.py: dual-site n_clusters<2 / n_psu<2 NaN guards
  via new _build_nan_bootstrap_results helper (closes the BLAS-roundoff
  zero-SE class predicted to recur on IF-based estimators)

Tests: 42 new tests in TestImputationDiDVcovType covering default /
cluster / TSL-survey / replicate-survey + bootstrap × cluster + bootstrap
× survey bit-equality (ALL parametrized over aggregate ∈ {None,
"event_study", "group"} with per-horizon and per-group SE override
branches pinned); fit()-time revalidation after set_params bypass;
bootstrap n_psu<2 / n_clusters<2 NaN propagation including coef_var NaN;
pretrends=True × vcov_type='hc1' × cluster bit-equality; introspection
(default attr, get_params, Results carries, to_dict, summary label
default+cluster+bootstrap-suppressed, cluster_name suppression under
both analytical AND replicate survey, fit-clone idempotence,
convenience function); input rejection on classical/hc2/hc2_bm/conley/
unknown with distinct methodology-keyword pins; cluster+replicate
rejection. Full pytest tests/test_imputation.py: 128 passed.

Docs:
- REGISTRY.md: IF-based taxonomy adds ImputationDiD to "Enforced today"
  tier; ImputationDiD section gains 4 new Notes (vcov_type contract,
  cluster+replicate fail-closed, bootstrap n<2 NaN, default unit-cluster
  CR1 rendering)
- CHANGELOG.md: [Unreleased] entry
- TODO.md: Phase 1b row narrowed to TwoStageDiD + EfficientDiD;
  ImputationDiD Conley follow-up row added
- guides/llms-full.txt: vcov_type + pretrends signature drift fix +
  shared staggered-results section advertises new variance metadata
  fields and to_dict()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request May 25, 2026
…-phase1b

ImputationDiD: thread vcov_type as narrow {hc1} contract per BJS Theorem 3 (Phase 1b interstitial #3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants